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Abstract 

Visual saliency, which predicts regions in the field of view that draw 
the most visual attention, has attracted a lot of interest from researchers. 
It has already been used in several vision tasks, e.g., image classification, 
object detection, foreground segmentation. Recently, the spectrum analy¬ 
sis based visual saliency approach has attracted a lot of interest due to its 
simplicity and good performance, where the phase information of the im¬ 
age is used to construct the saliency map. In this paper, we propose a new 
approach for detecting spatiotemporal visual saliency based on the phase 
spectrum of the videos, which is easy to implement and computationally 
efficient. With the proposed algorithm, we also study how the spatiotem¬ 
poral saliency can be used in two important vision task, abnormality de¬ 
tection and spatiotemporal interest point detection. The proposed algo¬ 
rithm is evaluated on several commonly used datasets with comparison 
to the state-of-art methods from the literature. The experiments demon¬ 
strate the effectiveness of the proposed approach to spatiotemporal visual 
saliency detection and its application to the above vision tasks. 


1 Introduction 

In the recent years modeling and detection of visual saliency has attracted a 
lot of interest in the vision community. One early work that is widely known is 
the approach by Itti et al. |Tj. Since then, a lot of different models have been 
proposed for computing visual saliency. Such models may be roughly divided 
into two groups: bottom-up models (or stimulus driven) that are mainly based 
on low-level visual features of the scene, and top-down model (goal-driven) that 
employs information and knowledge about a visual task. A survey of both groups 
of methods was reported in [2]. Visual saliency analysis has been applied with 
success to other vision tasks including object detection [3], image classification 
[4] and foreground segmentation j5j. 

Recently, spectral-based approach has gained increased interest due to its 
simplicity and good performance. In [6], the spectrum residual together with the 
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phase information was used to construct a saliency map. In [7] it was found that 
it is the phase information rather than the spectrum leads to a better saliency 
map. However, there was a lack of theoretic justification for such methods until 

where it was shown that, if the background is sparsely supported in the 
DCT domain and the foreground is sparsely supported in the spatial domain 
the foreground will receive high value on the computed saliency map. 

In the real world, the visual field-of-view of a human may constantly change, 
and thus visual saliency often depend on not only a static scene but also the 
changes in the scene. To this end, spatiotemporal saliency has been proposed, 
which tries to capture regions attracting visual attention in the spatiotemporal 
domain. Spatiotemporal saliency has been applied to vision tasks such as video 
summarization [9], human-computer interaction [ TO] . video compression m, 
and abnormality detection [12]. 

In this paper we propose a novel spatiotemporal visual saliency detector for 
video analysis, based on the phase information of the video. With the saliency 
map computed using the proposed method, we analysis how it can be used for 
two fundamental vision tasks, namely abnormality detection and spatiotemporal 
interest point detection. We evaluate the performance of the proposed algorithm 
using several widely used datasets, with the comparison to the state-of-art in 
the literature. 

The proposed method, compared with the existing work on spatiotemporal 
saliency in the literature, has several advantages. First, it computes the saliency 
information from the entire video span, which is different from many existing 
approaches in the literature. For example, [7] computes temporal information 
by only the differences of two adjacent frames, which is insufficient for modeling 
complex activities, as shown in the experiment part. Second, the proposed ap¬ 
proach is easy to implement and computationally efficient. The core parts of the 
algorithm involve only a three-dimensional Fourier transform, whose complex¬ 
ity is only 0(N log IV), where N is the size of the input. Last but not least, no 
training stage or prior information is needed for the proposed approach, which 
is a significant advantage for applications like abnormality detection. 

The rest of the paper is organized as follows: in Sec. [2] we describe the 
proposed method including the analysis and the relationships with the exist¬ 
ing methods; Sec. [3] evaluate the proposed spatiotemporal saliency detector in 
saliency detection on both synthetic dataset and two real video dataset; studies 
of how the spatiotemporal saliency computed by the proposed method can be 
used for two important vision tasks, abnormality detection and spatiotemporal 
interst point detection, is presented in Sec. 0 and the paper is concluded in Sec. 
□ 

2 Proposed Method 

As reviewed above, spectrum analysis based approaches to visual saliency has 
seen some success, although the existing work has been primarily on predicting 
salient objects for a given (static) image. For a dynamic scene, temporal in- 
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formation should be taken into consideration for properly predicting the salient 
objects. For example, it was found in m that objects attract more visual atten¬ 
tion if they move differently than their neighbors. Considering this, we propose 
to compute the saliency map of dynamic scenes by utilizing the phase informa¬ 
tion of the temporal domain together with the phase information of the spatial 
domain. In the proposed method, we compute the saliency map for 3D data 

x e k MxJ vxt as . 
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where Y = J-^X), T is 3D discrete Fourier transform and T~ x is the corre¬ 
sponding inverse transform. After we get the saliency map, we smooth it with 
a 3D Gaussian smooth filter. The 3D Fourier transform can be computed as: 


Y (u,v,w) (2) 

= ^^^X(i,j,t)e- ,2 '(5 + 5+¥) 
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i.e., the 3D Fourier transform can be computed as a sequence of ID Fourier 
transforms on each coordinate. 

The proposed method detects spatiotemporal saliency, which has been also 
discussed in some existing works. For example, in |7|, the detection was done by 
combining color information of one frame and the differences of this frame to 
the previous one with quaternion Fourier transform. As a result, the temporal 
information is limited to two adjacent frames and is insufficient for modeling 
complex scenes. On the other hand, the spatiotemporal saliency proposed in 
this paper considers the temporal information over a long temporal span up to 
the entire video. 

The method in Eqn. |T] evaluates the saliency of a region by exploring the 
information of the entire video. In some situations, we may also be interested 
in detecting a region that is salient within a temporal window of the video. 
For example, if a video contains multiple scenes, each capturing a different 
activity, we may be more interested in analysis the saliency within each scene 
instead of the entire video. For this reason, we propose multiscale analysis for 
spatiotemporal saliency, which is inspired by short-time Fourier transform. We 
first apply the window function to the input signal X • wwhere ■ is 
the element-wise multiplication and w(j, j, t) the window function centered at 
position {i,j, t ), which is nonzero for only a small support (i.e., the size of window 
function). The saliency map is computed for the windowed signal: 

Y = J-[X-w (i,j,t)} (3) 

Z = J 7-1 


Y 
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By sliding the window function on the input video, we still obtain the saliency 
map for the entire video. The size of the sliding window determines the temporal 
resolution: with a larger window, more global information of the input is revealed 
but the resolution is lower; with a smaller window, resolution is improved. The 
window function can be applied in either temporal direction, spatial direction 
or both. As a result, we can perform saliency detection from varying scales, 
which enables us to reveal the information at different spatiotemporal resolution, 
similar to short time Fourier transform. 

Combining different visual cues is important for not only scene saliency but 
also spatiotemporal saliency. In this paper, we proposed to compute the saliency 
map for each cue independently then compute the summation of saliency maps 
from all visual cues. In m, quaternion Fourier transform (QFT) is utilized to 
combine the three-channel color information and frame differences. However, 
the QFT could be very expensive (e.g., time consuming) when applied in spa¬ 
tiotemporal domain. In fact we find that (Appendix 0: given a data with four 
feature channels, the saliency map computed with QFT is very similar with the 
sum of saliency maps computed with FFT over each feature channel. 

Finally, we summarize the proposed algorithm below: 


Algorithm 

Input: data X, Gaussian filter g, window function w 
Output: saliency map Z 
For each window location 
For each feature channel 
Apply w to the input X; 

Compute Fourier coefficient Y = ,F[X]; 
Extract the phase information Y = j^-; 


Do the inverse transform Z = 




2 


Smooth saliency map Z = Z * g; 

End 

Combine the Z of all channels together; 

End 


where W is the window function. Currently, we only apply the window function 
along temporal direction and rectangle window is used. The size of the window is 
depending on the data. By incorporating the phase information of the temporal 
domain, the proposed method can not only suppress the background, as achieved 
by visual saliency for images, but also suppress the object which is static or 
moving “regularly”. 

2.1 Analysis 

There has been several explanations for why spectral domain based approach is 
able to detect saliency region from the image. For example, |14| explained by its 
biological plausibility that saliency map exists in the primary visual cortex (VI), 


4 






which is orientation selective and lateral surround inhibition m■ The spectral 
magnitude measures the total response of cells tuned to the specific frequency 
and orientation. According to lateral surround inhibition, similarly tuned cells 
will be suppressed depending on their total response, which can be modeled 
by dividing its spectral by the spectral magnitude m- 0 provided another 
explanation from sparse representation, which states that, if the foreground is 
sparse in spatial domain and background is sparse in DCT domain (e.g., periodic 
textures), the spectral domain based approach will highlight the foreground 
region in the saliency map. 

Motion, like color and texture, is also perceptually salient. m studied how 
three properties of motion, namely flicker, direction and velocity, contribute to 
this saliency. By setting the target object having different flicker rate, moving 
direction or motion velocity from the other objects, the target object can be 
easily identified by human subjects, i.e., being salient. In spectral, the target 
object and other objects can be mapped to two different bands (frequency and 
orientation), where the band corresponding to the target object has a much 
lower response than the band for the other object. Thus if we set the magnitude 
of the spectral to one, as the proposed method dose, the band for the other 
objects will be suppressed more than the target object, which makes the target 
object “poped out” in the output. In Sec. 13.11 we will verify this analysis with 
experiments on synthetic data. 

2.2 Relationship to Existing Works 

Our method is related to some existing works and based on the way in which 
they computed the temporal information, we can roughly divide them into two 
categories: 

1. Methods of the first category represent the temporal information by the 
motion, e.g., by frame differences [9] or by more dedicated motion esti¬ 
mation method including homography of adjacent frames j!8l and phase 
correlation El- However, methods of this temporal information typically 
have limited temporal span, e.g., two adjacent frames m tried to com¬ 
pute the frame differences of frames at a predefined sets of temporal spans), 
thus they are not sufficient for modeling the complex motion patterns. 

2. In this category, the saliency of a spatiotemporal cuboid (refer as cuboid 
later) is measured by the “differences” of this cuboid to other cuboids of 
the video or the template in the dictionaries, which may require high com¬ 
putational cost and/or require additional training data. The “differences” 
of cuboids can be measured by distances [20] , relative entropy m mutual 
information [22] and coding length increments [23] , 

The proposed method is different from these methods. First, it does not rely 
on prior knowledge. Instead, it explores within the input video to detect the 
potential “outliers”. Second, the “outliers” are found by exploring the whole 
temporal span. This makes the proposed algorithm be able to detect salient 
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patterns from complex dynamic background. In addition, the propose method 
has low computational costs and is easy to implement. Fourier transform for 
multiple dimensional data can be computed as a sequence of ID Fourier trans¬ 
form on each coordinate of the data, thus the computational cost of 3D Fourier 
transform for data X £ ^MxNxT j g o(M NT\og(M NT)). Thus the total com¬ 
putation cost for the proposed algorithm is 0{KMNT\og{MNT )), where K is 
the number of feature channels. 


3 Experiment 

In this section, we evaluate the proposed method in saliency detection on both 
syntheic data lSec. 13.11) and on two real image datasets fSec. 13.2D . CRCNS-ORIG 
and DIEM. The performance of the proposed methods are compared with the 
existing methods, some of which are state-of-art. 

3.1 Simulation Experiment 

In this section, we evaluate the proposed method on synthetic data. In mi, how 
three properties of motion, namely flicker, direction and velocity, contribute to 
the saliency was studied. In this section, we generate the synthetic data accord¬ 
ing to the their protocol. The input data is a short clip where the resolution 
is 174 x 174 with 400 frames at the frame rate of 60 frames per second. We 
put 36 objects of size 5 x 13 in a 6 x 6 grid and a target object is randomly 
selected out of those 36 objects. All the objects are allowed to move within a 
29 x 29 region centered at their initial position (and warped back, if they move 
out of this region). The video is black-and-white. We design the following three 
experiments: 

1. Flicker: we set the objects on-off at a specified rate and the target object 
at a different rate from the other 35 objects; 

2. Direction: we set the objects moving in a specified direction and the 
target object in a different direction. The velocity of all the objects are 
the same; 

3. Velocity: we set the objects moving in a specified velocity and the target 
object moves in a different velocity. The moving direction of the all the 
objects are the same. 

All the other parameters are the same as used in m- According to m, the 
target object could be easily identified by human subjects, when its motion 
property (e.g., flicker rate, moving direction, velocity) is different from the other 
objects. We also include some “blind” trials, where the target object has the 
same motion property as the other 35 objects. In this case, the target object 
can’t be identified by the human subjects, i.e., there is no salient region. 

We apply the proposed method to the input data. For comparison, we also 
evaluate the method proposed in 14) and [8]. We use the area under receiver 
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operating characteristic curve as the performance metric. The ground truth 
mask is generated according to the location of the target object. The experiment 
result is shown in [T] 
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Figure 1: The AUC on the synthetic data for the proposed method and two 
existing methods. For “Direction” and “Velocity”, we also include some “blind” 
trials (X-axis has value 0), where the target object has exactly the same motion 
property as the other 35 objects. In those trials, the target object can’t be 
identified by human subjects, i.e., there is no salient object [Hi- 
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Figure 2: Some visual sample of the synthetic data for different experiments. 

From the experiment results, we can find that the proposed method detects 
the salient region much more accurately than m and 0 in all except the “blind” 
trials. For the “blind” trials, the AUC for the proposed method significantly 
reduces, which shows that the proposed method is also robust. However, m and 
[5] don’t survive in those “blind” trials. Surprisingly, [13] and [5] achieves quite 
similar performances, though m was supposed to achieve better result as it 
include the differences of two adjacent frames as motion (temporal) information. 
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3.2 Spatiotemporal Saliency Detection 

In previous section, we test the proposed spatiotemporal saliency detector on 
synthetic videos, with the comparison to two other saliency detectors, where 
the proposed detector shows better performances in capturing the temporal 
information. In this section, we evaluate the proposed spatiotemporal saliency 
detector on two challenging video datasets for saliency evaluation, CRCNS- 
ORIG [2U and DIEM [25]. For this experiment, we first convert each frame into 
the LAB color space, then compute the spatiotemporal saliency in each channel 
independently and the final spatiotemporal saliency is the summation of the 
saliency maps of all three channels. 

CRCNS-ORIG includes 50 video clips from different genres, including TV 
programs, outdoor scenes and video games. Each clip is 6-second to 90-second 
long at 30 frames per second. The eye fixation data is captured from eight sub¬ 
jects with normal or correct-normal vision. In our experiment, we downsample 
the video from 640 x 480 to 160 x 120 and keep the frame rate untouched, then 
apply the our spatiotemporal saliency detector. To measure the performance, we 
compute the area under curve (AUC) and F-measure (harmonic mean of true 
positive rate and false positive rate). The experiment result is shown in Fig. [3] 
where the area under curve (AUC) is 0.6639 and F-measure is 0.1926. Tab. [T] 
compares the result of the proposed method with some state-of-art methods on 
CRCNS-ORIG, which indicates that our method outperforms them by at least 
0.06 regarding AUC. The per-video AUC score is shown in Fig. [9] in Appendix 

EH ' 



Figure 3: The receiver operating characteristic curve of the propose method in 
CRCNS-ORIG dataset and DIEM dataset. The area under the curve is 0.6639 
and 0.6896 accordingly. 

DIEM dataset collects data of where people look during dynamic scene view¬ 
ing such as film trailers, music videos, or advertisements. It currently consists 








Method 

AUC 

Method 

AUC 

aws Eg 

0.6000 

aws png 

0.5770 

HouNIPS [27] 

0.5967 

Bian M 

0.5730 

Bian [H] 

0.5950 

Marat [28] 

0.5730 

IO 

0.5950 

Judd i29j 

0.5700 

SR [6] 

0.5867 

AIM [30] 

0.5680 

Torralba [31] 

0.5833 

HouNIPS [27 ] 

0.5630 

Judd (29 

0.5833 

Torralba [31] 

0.5840 

Marat [28] 

0.5833 

GBVS [32] 

0.5620 

Rarity-G [33] 

0.5767 

SR [6] 

0.5610 

CIOFM [34] 

0.5767 

cio Eg 

0.5560 

Proposed 

0.6639 

Proposed 

0.6896 


Table 1: The result the proposed method compared with the results of the 
top ten existing methods on CRCNS dataset (left) and DIEM dataset (right) 
according to [35]. From this table, we can find that the propose method gets 
obvious better performances than the state-of-arts on both two datasets. 

of data from over 250 participants watching 85 different videos. Each video in 
DIEM dataset includes 1000 to 6000 frames at 30 frames per second. Similarly 
as CRCNS, we downsample the video to 1/4 (e.g., from 1280 x 720 to 320 x 180) 
while maintaining the aspect ratio and frame rate. We observe that each video 
in DIEM dataset is consisted of a sequence of short clips, where each clip has 30 
to 100 frames. To properly detect the saliency from those videos, we apply the 
window function to our spatiotemporal saliency detector, where the size of the 
window (along temporal direction) is 60-frame. The experiment result is shown 
in Fig. [3] and Tab.Q] where the AUC is 0.6896 and F-measure is 0.35. From the 
table, we can find that the proposed method outperforms the state-of-arts by 
over 10%. The per-video AUC score is shown in Fig. [TQ] in Apnendix lA.il 


4 Application of Spatiotemporal Saliency 

In the previous section, we show that the proposed method is able to detect the 
saliency region in the video. The saliency detection for image has been used more 
and more in other visual tasks, e.g., image segmentation, object recognition. 
A natural question arises that can we also appliy the spatiotemporal saliency 
detection for some important vision tasks. In this section, we show how can 
we applied the spatiotemporal saliency computed by the proposed methods to 
two important vision tasks, i.e., abnormality detection m and spatiotemporal 
interest pointer detection (14.21) . 

4.1 Abnormality Detection 

According to our previous analysis, the salient region should be different from 
the neighbor, both spatially and temporally. This spatiotemporal saliency shares 
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Method 

AUC 

Optical flow [37] 

0.84 

Social force [37] 

0.96 

Chaotic invariants |40j 

0.99 

NN [38] 

0.93 

Sparse reconstruction [38] 

0.978 

Interaction force ]41j 

0.9961 

Proposed 

0.9378 


Table 2: The result on UMN dataset. Note, we have cropped out the region which 
contains the text “abnormal”, and results in frame resolution 214 x 320. Please 
note that, most of those methods, except the proposed one, need a training 
stage. 

a lot of common to the concept of abnormality in video. Thus in this section, 
we show how can we utilize the proposed spatiotemporal saliency detector to 
detect abnormality from the video. 

For abnormality detection, we start with computing the saliency map for 
the input video as described above. The regions containing abnormalities can 
be detected by founding the region where the saliency value is above a threshold. 
Then the saliency score of a frame is computed as the average of saliency value 
of the pixels in that frame, i.e., 

s W = iiEE x ^' ( ) ( 4 ) 

* 3 

where s(t) is the saliency score of t t h frame, TV x M is the size of one frame, i, 
j. t are row, column and frame index of the 3D saliency map accordingly. The 
frame with high saliency score would contain abnormality. 

We evaluate the proposed method for abnormality detection in videos from 
two datasets: UMN abnormal datasetQ and UCSD dataset [36]. Abnormal de¬ 
tection has attracted a lot efforts from the researchers. However, most of the 
existing works require training stage, e.g., social force E3, sparse reconstruc¬ 
tion [38], MPPCA [35], i.e., they need training data to initialize the model. The 
proposed method, instead, dose NOT need any training stage or training data. 

The result on UMN abnormal dataset is shown in Tab. [2j where we compute 
the frame-level true positive rate and false positive rate then compute the area 
under the ROC (Fig. SJ). Fig. [5] shows the result for videos of three scenes, where 
we plot saliency value of each frame and show some sample frames. The result on 
UCSD dataset is shown in Tab. [3] where we report frame-level equal-error rate 
(EER) [36] . Fig.[G]shows the ROC for UCSD dataset with the proposed method; 
Fig. [7] shows eight samples frames, where red color highlights abnormal regions. 
We can find that, without training data, the proposed method still outperforms 
several state-of-arts in the literature, e.g., social force, MPPCA. 

1 http: //mha. cs . umn. edu/Movies/Crowd-Activity-All. avi 
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Figure 4: The ROC for the UMN dataset computed with the propose method. 


Method 

Pedl 

Ped2 

Overall 

Social force [37] 

31% 

42% 

37% 

MPPCA [35] 

40% 

30% 

35% 

MDT [35] 

25% 

25% 

25% 

Adam [421 

38% 

42% 

40% 

Reddy [43] 

22.5% 

20% 

21.25% 

Sparse [38 a 

19% 

N.A. 

N.A. 

Proposed 

27% 

19% 

23% 


Table 3: The frame level EER (the lower the better) for UCSD dataset. Please 
note that, most of those methods, except the proposed one, need a training 
stage. From the result, we can found that the proposed method, even without 
traing stage or training data, can still outperform social force, MPPCA. 

4.2 Spatiotemporal Saliency Point Detector 

The regions which attracts human’s attention most would contribute most to 
people’s perception of the scene. The saliency map computed with the proposed 
method will hightlight those regions. Thus we propose to sample the interest 
points based on the saliency map of the data, which we refer as spatiotemporal 
saliency point detector (STSP). 

To detect interest point, we also starts with computing the saliency map Z 
for the input data X. Then we apply non-maximum suppression on the saliency 
map to sample the interest points: an interest point is selected at (x, y, t) if and 
only if 


Z (x,y,t) > p (5) 

Z (x,y,t) > Z(i, j, k) € N(x,y,t) 
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Scene 1 


Scene 2 



Scene 3 


Figure 5: Some sample results for the UMN datasets, where we pick one video 
for each scene. The top is the saliency value (Y-axis) for each frame (X-axis) 
and bottom are sample frames picked from different frames (as shown by the 
arrow). 
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Figure 6: The ROC for the UCSD dataset computed with the propose method. 




Pedsl: Wheelchair 






Peds2: Skater 


Peds2: Bike 


Figure 7: Some sample results for the UCSD datasets, where the red color high¬ 
lights the detected abnormal region, i.e., the saliency value of the pixel is higher 
than four times of the mean saliency value of the video. 

where p is a predefined threshold (e.g., 2p) and N(x,y,t) is the set of positions 
near (pc, y , t). 

Similar as [44], for each interest points (x, y, t), we extract a descriptor within 
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its neighbor area characterized as ( x,y,t,a,r ), where ( x,y,t ) is the center, a , 
t are the spatial and temporal scales (we use 18 x 18 x 10, 25 x 25 x 14 and 
36 x 36 x 20 here) accordingly. The neighbor is further divided into multiple 
subblocks (e.g., 3x3x2 along spatial and temporal direction accordingly); for 
each subblock, we computed the 3D gradient g = [g x , g y , g t ]; then we quantize 
the orientations of the gradients into a histogram of four bins; finally the his¬ 
togram of each subblock is normalized to unit l\ norm and histograms of all 
subblocks is concatenated into one histogram, i.e., the descriptor for interest 
point ( x , y , t ). 

Compared with existing spatiotemporal interst point detectors, which mostly 
choose the location where the gradient is strong and stable cross different scales. 
However, the gradient is a low level information and is insufficient to capture 
the complex dynamics as the human vision does. Instead, the proposed method 
explores the relationship of each location over all spatial and temporal span, 
which is able to model complex dynamics in the video. 

For evaluation, we use three datasets: Weizmann dataset [46], KTH dataset 
[46] and UCF sports dataset m ■ Since the method is proposed for detecting 
interest points, we only compare it with several state-of-art spatiotemporal in¬ 
terest point detectors including Harris3D [44], Gabor [45], Hessian3D [49] and 
dense sampling ED], where the result are summarized in |50j . The parameters 
of the detectors are set as suggested by the paper accordingly. 

Fig. [5] shows the saliency map for some sample frames of videos from UCF 
sports action dataset and KTH dataset. From the figure, we can found that 
the saliency map computed with the proposed method highlights the moving 
region while suppressing the background. The proposed method is also robust 
to moving background (Row 1), clutter background (Row 2) and scale variation 
(Row 3). In addition, from Row 3 to 4, we can found the moving parts of 
body, e.g., hands, get higher saliency value (red color) then other body parts. 
The spatiotemporal saliency interest points will be mostly sampled from those 
hightlighted regions and augment the description of the action of interest. 

To quantitatively evaluate the performances of different detector, we use 
the interest points detected by those detector to train a classifiers for activity 
recognition. We use both histogram of gradient (HoG) and histogram of optical 
flow (HoF) as the descriptor. Bag of words is used to represent the video, where 
each input is represented as a histogram of words in the codebook (size of of 
codebook is k = 2000); then classifier (support vector machine with % 2 kernel) 
is applied to those histograms to classify the input. For Weizmann dataset and 
UCF sports dataset, we use leave-one-out scheme for training and testing; for 
KTH dataset, we follow the standard partition in [46]. 

Tab.[4]reports the performances of different detectors on three dataset, where 
we test extracting feature on the original video and also extracting feature on 
the saliency map of the original video (refer as “proposed*”). From the table we 
find that, the proposed method (and “proposed*”) achieves the best result over 
all three datasets. Especially “proposed*” achieved the best results for KTH 
dataset and Weizmann data; “proposed” achieved the best results for UCF 
sports action dataset. For video with simple background(e.g., KTH dataset and 
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Method 

Weizmann 

KTH 

UCF sports 

Harris3D 

85.6% 

91.8% 

78.1% 

Gabor 

N.A. 

88.7% 

77.7% 

Hessian3D 

N.A. 

88.7% 

79.3% 

Dense 

N.A. 

86.1% 

81.6% 

Proposed 

84.5% 

88.0% 

86.7% 

Proposed* 

95.6% 

92.6% 

85.6% 


Table 4: The performances of different detectors on three datasets. For “pro¬ 
posed*” , we extract the descriptor on the saliency map instead of on the video. 

Weizmann dataset), extracting descriptor on saliency map instead of the video 
itself could be a better choice. 



Figure 8: Some samples frames (left) from UCF sports action dataset (Row 1, 
2) and KTH dataset (Row 3, 4) with their saliency maps (right). 
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5 Conclusion and Discussion 


In this paper, we proposed a novel approach for detecting spatiotemporal saliency, 
which was simple to implement and computationally efficient. The proposed ap¬ 
proach was inspired by recent development of spectrum analysis based visual 
saliency approaches, where phase information was used for constructing the 
saliency map of the image. Recognizing that the computed saliency map cap¬ 
tured the region of human’s attention for dynamic scenes, we proposed two 
algorithms utilizing this saliency map for two important vision tasks. These ap¬ 
proaches were evaluated on several well-known datasets with comparisons to the 
state-of-arts in the literature, where good results were demonstrated. For the 
future work, we will focus on theoretical analysis of the proposed method and 
the analysis on the selection of the window function. 

A Appendix 

To compare the performances of combining four visual cues via QFT and per¬ 
formances via summation of saliency maps of each visual cues, we design the 
following experiment. We run 1000 simulations and in each simulation we gen¬ 
erate a r x c x 4 array, where r and c is a random number between [1,1000] and 
4 is the number of feature channels. We compute the saliency map with differ¬ 
ent methods then measures their similarities via cross-correlation, where 0.91 is 
reported for QFT and FFT. After smoothing the saliency map with a Gaussian 
kernel, the correlation is over 0.998. For natural image, we could expect an even 
higher correlation. 

This suggests that, we can compute the saliency map for each visual cue 
independently and then add them together, which will yield quite similar result 
by using quaternion Fourier transform. In addition, the proposed method other 
than QFT provides more flexibility, e.g., we can assign different weights to the 
visual cues as [29] . 

A.l Supplementary Results 

We also include the AUC of the proposed method for each video from the 
CRCNS-ORIG (Fig.® and DIEM dataset (Fig. Hoj). 
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